-
-
Notifications
You must be signed in to change notification settings - Fork 122
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Cheap isFill check and add --skip-filled-tiles
option
#234
Conversation
--skip-filled-tiles
option--skip-filled-tiles
option
Testing on the planet:
|
https://github.com/onthegomap/planetiler/actions/runs/2437903128 ℹ️ Base Logs b0f634b
ℹ️ This Branch Logs 09b71b8
|
if (encoded.length > 1_000_000) { | ||
LOGGER.warn("{} {}kb uncompressed", | ||
tileFeatures.tileCoord(), | ||
encoded.length / 1024); | ||
} | ||
if (compactDb && tileFeatures.getNumFeaturesToEmit() < MAX_FEATURES_HASHING_THRESHOLD) { | ||
if (compactDb && en.containsOnlyPolygonFills()) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@bbilger what do you think of using this here instead of the feature count threshold?
The only issue I'm thinking is there are currently seams where ocean polygons overlap where any tile containing the seam won't be just fills but they will be repeated. Another option there would be to merge these ocean polygons (implemented in #235)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@msbarry hard to say w/o a benchmark for encoding but that seems a bit faster. I ran it for Australia twice on this branch and on the main branch. The downside is that the file size increases by almost 100MB since some opportunities to dedupe seem to be missed. So it depends on the goal whether to increase performance a bit or to reduce the file size.
# 1 skip-filled-tiles branch
mbtiles 1m6s cpu:10m13s gc:4s avg:9.3
read 1x(22% 15s sys:2s wait:46s done:3s)
encode 11x(68% 45s wait:12s done:3s)
write 1x(57% 38s sys:4s wait:23s)
----------------------------------------
features 3.6GB
mbtiles 1.8GB
# 2 skip-filled-tiles branch
mbtiles 1m7s cpu:10m23s gc:4s avg:9.3
read 1x(21% 14s sys:1s wait:46s done:3s)
encode 11x(68% 46s wait:12s done:3s)
write 1x(56% 38s sys:4s wait:24s)
----------------------------------------
features 3.6GB
mbtiles 1.8GB
# 3 main branch
mbtiles 1m8s cpu:10m48s gc:4s avg:9.5
read 1x(10% 7s wait:58s done:3s)
encode 11x(72% 49s wait:10s done:3s)
write 1x(55% 37s sys:3s wait:26s)
----------------------------------------
features 3.6GB
mbtiles 1.7GB
# 4 main branch
mbtiles 1m9s cpu:10m50s gc:4s avg:9.4
read 1x(10% 7s wait:58s done:3s)
encode 11x(72% 49s wait:11s done:3s)
write 1x(54% 37s sys:3s wait:26s)
----------------------------------------
features 3.6GB
mbtiles 1.7GB
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
oops, part of this might that I landed #235 but didn't merge it into this branch until just now - the water polygon seams add quite a bit of size and merging them in encode adds a bit of extra time
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
okay, after rebasing the results are almost identical, so don't have any concerns then.
# 5 skip-filled-tiles branch rebased
mbtiles 1m9s cpu:10m54s gc:4s avg:9.5
read 1x(9% 6s wait:58s done:3s)
encode 11x(72% 49s wait:10s done:3s)
write 1x(54% 37s sys:3s wait:26s)
----------------------------------------
features 3.6GB
mbtiles 1.7GB
# 6 skip-filled-tiles branch rebased
mbtiles 1m9s cpu:10m50s gc:4s avg:9.4
read 1x(10% 7s wait:59s done:3s)
encode 11x(71% 49s wait:11s done:3s)
write 1x(53% 37s sys:3s wait:26s)
----------------------------------------
features 3.6GB
mbtiles 1.7GB
(it doesn't matter at all but just to mention: file size is 10MB larger with containsOnlyPolygonFills)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We could use num features < 5 OR containsOnlyPolygonFills. On Australia, there are ~40k unique fill tiles, but 800k unique tiles with < 5 features. It's probably worth the slight file size increase to lower the risk of a hash collision...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
cntFilled = 636515
cntLt5 = 1422225
ctFilledLt5 = 635047
cntFilledGte5 = 1468
Counted in #tileEncoder like so
...
if (en.containsOnlyPolygonFills()) {
cntFilled.incrementAndGet();
}
if (tileFeatures.getNumFeaturesToEmit() < 5) {
cntLt5.incrementAndGet();
}
if (en.containsOnlyPolygonFills() && tileFeatures.getNumFeaturesToEmit() < 5) {
ctFilledLt5.incrementAndGet();
}
if (en.containsOnlyPolygonFills() && tileFeatures.getNumFeaturesToEmit() >= 5) {
cntFilledGte5.incrementAndGet();
}
if (compactDb && en.containsOnlyPolygonFills()) {
tileDataHash = tileFeatures.generateContentHash();
} else {
tileDataHash = null;
}
...
# containsOnlyPolygonFills
select count(*) from tile_data; -- 1290598
# tileFeatures.getNumFeaturesToEmit() < MAX_FEATURES_HASHING_THRESHOLD
select count(*) from tile_data; -- 1254643
To start with, I don't fully understand the motivation for replacing tileFeatures.getNumFeaturesToEmit() < MAX_FEATURES_HASHING_THRESHOLD
by containsOnlyPolygonFills
. The threshold check seems simple enough. But I assume you want to avoid unnecessary hash code generation & lookups.
So on the one hand: with the threshold approach we hash twice, and more than required but I think it should be fast enough, and we can remove a few more dupes (~10MB less).
On the other hand with containsOnlyPolygonFills
we miss a bunch of dupes, and this check also comes at a cost - but w/o a benchmark it's hard to say which is better/worse - would assume both to be more or less identical overall.
So I'd tend to just keep tileFeatures.getNumFeaturesToEmit() < MAX_FEATURES_HASHING_THRESHOLD
. But if avoiding hash generation/lookup is the main motivation and since the few missed dupes are neglectable, I would go with if (compactDb && tileFeatures.getNumFeaturesToEmit() < MAX_FEATURES_HASHING_THRESHOLD && en.containsOnlyPolygonFills())
(i.e. first check threshold to avoid uneccessary #containsOnlyPolygonFills checks)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The thought here was that "is fill" would be a more accurate indicator for whether a tile would be repeated a lot than the feature count heuristic. For example some future tileset where there are >5 polygons overlapping for a large portion of the planet.
For Australia (16,169,927 tiles), it looks like:
# features | # tiles | repeated tiles | repeated (but not fill) tiles |
---|---|---|---|
1 | 13,331,155 | 13,101,957 | 6,062 |
2 | 1,681,737 | 1,400,406 | 7,283 |
3 | 583,603 | 406,630 | 17,596 |
4 | 140,253 | 17,113 | 4,566 |
5 | 65,943 | 107 | 103 |
6 | 47,374 | 15 | 15 |
7 | 36,311 | 8 | 8 |
8 | 28,772 | 5 | 5 |
9 | 23,901 | 4 | 4 |
>=10 | 230,878 | 31 | 31 |
So the isFill check catches 99.8% of repeated tiles and there end up being 41,350 distinct repeated tiles, but the <5 heuristic catches nearly 100% of repeated tiles, but there end up being 810,670 (20x more) distinct tiles. So I guess it's a tradeoff of whether we'd rather get the additional 0.2% to lower the mbtiles size slightly or if we want to minimize the chances of a hash collision by deduping 20x fewer tile hashes.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Okay, thanks for the additional explanation and crunching more numbers.
So given that the "is fill"-check should be cheap&fast like you said, and
For example some future tileset where there are >5 polygons overlapping for a large portion of the planet.
and
isFill check catches 99.8% of repeated tiles
and
minimize the chances of a hash collision
are all very good arguments to me to go with your suggestion to use "is fill" instead - and keep the #features out of the equation entirely here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK thanks, I was trying to decide whether isFill() || numFeatures() <= N
would be better to catch more repeated tiles, but even just using the lowest threshold of 1 increases # distinct tiles being hashed to 267k (6-7x more) and only catches 17% of the remaining tiles, so it's probably not worth it.
If we wanted to catch more of the repeated tiles, it would probably be better to expand the isFill logic to also identify if the tile contains only edge of a rectangle or segment of a vertical/horizontal line (for example if a tileset includes latitude/longitude graticules).
given that the "is fill"-check should be cheap&fast like you said
I double checked, visualvm shows 100ms on my macbook for australia and it's only taking ~100ms
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks like the "isEdge" check on lines and rectangle catches almost all of the other repeated tiles and only increases # distinct tiles to track to 53,491 (30% increase), so I'll add that in to be safe and cover future cases with graticules.
planetiler-core/src/main/java/com/onthegomap/planetiler/VectorTile.java
Outdated
Show resolved
Hide resolved
Kudos, SonarCloud Quality Gate passed! |
Implement a cheap
isFill()
check for whether an encoded vector geometry only contains fill polygons and expose--skip-filled-tiles
command line option that you can use to omit those from the output (for example if your client over-zooms parent tiles in this case).Also switch to use
isFill
instead of the feature count heuristic to decide whether to compute a tile hash in--compact-db
mode (#219). This will be useful as part of pmtiles writer #98 to decide which tiles to attempt to deduplicate.Fixes #168